ASR for South Slavic Languages Developed in Almost Automated Way

نویسندگان

  • Jan Nouza
  • Radek Safarík
  • Petr Cerva
چکیده

Slavic languages pose several specific challenges that need to be addressed in an ASR system design. Since we have already built an engine suited for highly-inflected languages, we focus on adopting it for new languages, now. In this case, we present an efficient way to adapt the system to all (seven) South Slavic languages, using methods and tools that benefit from language similarities, easily adjustable G2P rules or common phonetic subsets. We show that it is possible to build accurate language and acoustic models in an almost automated way, entirely from resources found on the web. The AMs are trained via cross-lingual bootstrapping followed by lightly supervised retraining from public data, like broadcast and parliament archives. Tests done on a set of main broadcast news in each language show WER values in range 16.8 to 21.5 %, which includes also errors caused by OOL (out-of-language) utterances often occurring in this type of spoken programs.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Speech Technologies for Serbian and Kindred South Slavic Languages

This chapter will present the results of the research and development of speech technologies for Serbian and other kindred South Slavic languages used in five countries of the Western Balkans, carried out by the University of Novi Sad, Serbia in cooperation with the company AlfaNum. The first section will describe particularities of highly inflected languages (such as Serbian and other language...

متن کامل

A smartphone-based ASR data collection tool for under-resourced languages

Acoustic data collection for automatic speech recognition (ASR) purposes is a particularly challenging task when working with underresourced languages, many of which are found in the developing world. We provide a brief overview of related data collection strategies, highlighting some of the salient issues pertaining to collecting ASR data for under-resourced languages. We then describe the dev...

متن کامل

PIE inheritance and word-formational innovation in Slavic motion verbs in -i-

The unprefixed imperfective verbs of motion with present tense in -i (such as Russian vodit’, vozit’, bežat’), most of which are considered indeterminate in the modern languages, developed over a lengthy period from Proto-Indo-European to the disintegration of Proto-Slavic. The final period of their development in Slavic shows striking innovation in the formal and semantic structures, including...

متن کامل

Genetic Heritage of the Balto-Slavic Speaking Populations: A Synthesis of Autosomal, Mitochondrial and Y-Chromosomal Data

The Slavic branch of the Balto-Slavic sub-family of Indo-European languages underwent rapid divergence as a result of the spatial expansion of its speakers from Central-East Europe, in early medieval times. This expansion-mainly to East Europe and the northern Balkans-resulted in the incorporation of genetic components from numerous autochthonous populations into the Slavic gene pools. Here, we...

متن کامل

Language Related Issues for Machine Translation between Closely Related South Slavic Languages

Machine translation between closely related languages is less challenging and exhibits a smaller number of translation errors than translation between distant languages, but there are still obstacles which should be addressed in order to improve such systems. This work explores the obstacles for machine translation systems between closely related South Slavic languages, namely Croatian, Serbian...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016